Wrangling data in R

Leonard Blaschek

“80% of data science is cleaning”

— Ancient proverb

A quick word about myself

Types of data

  1. Excel sheets
  2. Delimited text files
  3. Folders of raw data
  1. Insane, lawless text files
  2. Proprietary formats

Tidy Data1

  • No white space
  • One observation per row
  • One variable per column
  • No information in formatting



Spot the untidyness!

Minimally tidied-up

R fundamentals

ggplot()            # function
ggplot              # object
996107              # number
"ggplot"            # string
?ggplot()           # show help page 
library(tidyverse)  # use library() to load the tidyverse package

Package vignettes

?readr # navigate to package index and then vignettes

Function help pages

?read_tsv()

Arguments without default need to be supplied.

File paths and operating systems

Use absolute paths:

  • Windows (ctrl+c): C:/Users/leonard/Desktop
    • Windows (silly): C:\Users\leonard\Desktopflip the slashes!
  • MacOS (cmd+opt+c): /Users/leonard/Desktop
  • Linux (ctrl+alt+c): /home/leonard/Desktop



or relative paths from the working directory (read up on Rstudio projects, setwd() and getwd())

The Pipe: Using function output as input





library(tidyverse)
nrow(read_csv("data/cleaned_example.csv"))
[1] 32
"data/cleaned_example.csv" |>
  read_csv() |>
  nrow()
[1] 32



The pipe is typed as either %>% or |>1

Delimited text files

readr::read_csv()          #Comma delimited, decimal points
readr::read_csv2()         #Semicolon delimited, decimal comma

readr::read_tsv()          #Tab delimited

readr::read_delim()        #Pick your own delimiter

readxl::read_xlsx()        #Excel files

Folders of files

data_files <- list.files(
  path = "/path/to/folder",     #Folder containing data files
  pattern = ".tsv",             #Pattern that matches files of interest
  recursive = TRUE,             #Look in sub-folders
  full.names = TRUE             #Return full file path
)

Folders of files

data_files <- list.files(
  path = "/path/to/folder",     #Folder containing data files
  pattern = ".tsv",             #Pattern that matches files of interest
  recursive = TRUE,             #Look in sub-folders
  full.names = TRUE             #Return full file path
)

data <- read_tsv(data_files, id = "path")

Exercise 1

https://github.com/leonardblaschek/data_wrangling

Data cleaning

Data cleaning

data <- read_csv("data/cleaned_example.csv")

data
# A tibble: 32 × 7
   Sample   Replicate    Abs `Abs minus blank` Amount (based on standa…¹ Weighed
   <chr>        <dbl>  <dbl>             <dbl>                     <dbl>   <dbl>
 1 Blank           NA 0.0595            NA                        NA          NA
 2 Zfp2-1 1         1 0.267              0.208                     1.06        8
 3 <NA>             2 0.26               0.200                     1.01        5
 4 <NA>             3 0.224              0.164                     0.751       5
 5 Zfp2-1 2         1 0.275              0.216                     1.12        6
 6 <NA>             2 0.233              0.174                     0.816       5
 7 <NA>             3 0.298              0.238                     1.28        6
 8 Zfp2-1 3         1 0.209              0.150                     0.644       6
 9 <NA>             2 0.285              0.226                     1.19        6
10 <NA>             3 0.275              0.216                     1.12        6
# ℹ 22 more rows
# ℹ abbreviated name: ¹​`Amount (based on standard curve)`
# ℹ 1 more variable: `Weight%` <dbl>

Systematically missing values: fill()

data_filled <- data |>
  fill(Sample, .direction = "down")

data_filled
# A tibble: 32 × 7
   Sample   Replicate    Abs `Abs minus blank` Amount (based on standa…¹ Weighed
   <chr>        <dbl>  <dbl>             <dbl>                     <dbl>   <dbl>
 1 Blank           NA 0.0595            NA                        NA          NA
 2 Zfp2-1 1         1 0.267              0.208                     1.06        8
 3 Zfp2-1 1         2 0.26               0.200                     1.01        5
 4 Zfp2-1 1         3 0.224              0.164                     0.751       5
 5 Zfp2-1 2         1 0.275              0.216                     1.12        6
 6 Zfp2-1 2         2 0.233              0.174                     0.816       5
 7 Zfp2-1 2         3 0.298              0.238                     1.28        6
 8 Zfp2-1 3         1 0.209              0.150                     0.644       6
 9 Zfp2-1 3         2 0.285              0.226                     1.19        6
10 Zfp2-1 3         3 0.275              0.216                     1.12        6
# ℹ 22 more rows
# ℹ abbreviated name: ¹​`Amount (based on standard curve)`
# ℹ 1 more variable: `Weight%` <dbl>

Filtering rows: filter()

data_filtered <- data_filled |>
  filter(Sample != "Blank")

data_filtered
# A tibble: 30 × 7
   Sample   Replicate   Abs `Abs minus blank` Amount (based on standar…¹ Weighed
   <chr>        <dbl> <dbl>             <dbl>                      <dbl>   <dbl>
 1 Zfp2-1 1         1 0.267             0.208                      1.06        8
 2 Zfp2-1 1         2 0.26              0.200                      1.01        5
 3 Zfp2-1 1         3 0.224             0.164                      0.751       5
 4 Zfp2-1 2         1 0.275             0.216                      1.12        6
 5 Zfp2-1 2         2 0.233             0.174                      0.816       5
 6 Zfp2-1 2         3 0.298             0.238                      1.28        6
 7 Zfp2-1 3         1 0.209             0.150                      0.644       6
 8 Zfp2-1 3         2 0.285             0.226                      1.19        6
 9 Zfp2-1 3         3 0.275             0.216                      1.12        6
10 Zfp2-1 4         1 0.17              0.110                      0.364       5
# ℹ 20 more rows
# ℹ abbreviated name: ¹​`Amount (based on standard curve)`
# ℹ 1 more variable: `Weight%` <dbl>

Comparisons

  • A > B
  • A == A
  • A != B
  • A %in% c(B, C)
  • !, &, |
  • is.na(A)
  • … see ?filter()

Selecting columns: select()

data_selected <- data_filtered |>
  select(
    # select columns by name or position; use ["New name" = old name] to rename columns
    Sample,
    Replicate,
    "Proportion" = `Weight%` # wrap non-syntactic names in backticks (``)
  )

data_selected
# A tibble: 30 × 3
   Sample   Replicate Proportion
   <chr>        <dbl>      <dbl>
 1 Zfp2-1 1         1     0.132 
 2 Zfp2-1 1         2     0.202 
 3 Zfp2-1 1         3     0.150 
 4 Zfp2-1 2         1     0.186 
 5 Zfp2-1 2         2     0.163 
 6 Zfp2-1 2         3     0.214 
 7 Zfp2-1 3         1     0.107 
 8 Zfp2-1 3         2     0.198 
 9 Zfp2-1 3         3     0.186 
10 Zfp2-1 4         1     0.0729
# ℹ 20 more rows

Separating compound variables: separate_*()

data_sep <- data_selected |>
  separate_wider_delim(
    Sample,
    names = c("Genotype", "Biorep"),
    delim = " "
  )

data_sep
# A tibble: 30 × 4
   Genotype Biorep Replicate Proportion
   <chr>    <chr>      <dbl>      <dbl>
 1 Zfp2-1   1              1     0.132 
 2 Zfp2-1   1              2     0.202 
 3 Zfp2-1   1              3     0.150 
 4 Zfp2-1   2              1     0.186 
 5 Zfp2-1   2              2     0.163 
 6 Zfp2-1   2              3     0.214 
 7 Zfp2-1   3              1     0.107 
 8 Zfp2-1   3              2     0.198 
 9 Zfp2-1   3              3     0.186 
10 Zfp2-1   4              1     0.0729
# ℹ 20 more rows

Wide and long data: pivot_*()

data_wide <- data_sep |>
  pivot_wider(
    names_from = Genotype,
    values_from = Proportion
  )

data_wide
# A tibble: 15 × 4
   Biorep Replicate `Zfp2-1` `Zfp2-2`
   <chr>      <dbl>    <dbl>    <dbl>
 1 1              1   0.132    0.186 
 2 1              2   0.202    0.266 
 3 1              3   0.150    0.274 
 4 2              1   0.186    0.203 
 5 2              2   0.163    0.134 
 6 2              3   0.214    0.187 
 7 3              1   0.107    0.188 
 8 3              2   0.198    0.219 
 9 3              3   0.186    0.223 
10 4              1   0.0729   0.164 
11 4              2   0.175    0.221 
12 4              3   0.150    0.176 
13 5              1   0.135    0.0827
14 5              2   0.178    0.205 
15 5              3   0.132    0.212 
data_long <- data_wide |>
  pivot_longer(
    c(`Zfp2-1`, `Zfp2-2`),
    names_to = "Genotype",
    values_to = "Proportion"
  )

data_long
# A tibble: 30 × 4
   Biorep Replicate Genotype Proportion
   <chr>      <dbl> <chr>         <dbl>
 1 1              1 Zfp2-1        0.132
 2 1              1 Zfp2-2        0.186
 3 1              2 Zfp2-1        0.202
 4 1              2 Zfp2-2        0.266
 5 1              3 Zfp2-1        0.150
 6 1              3 Zfp2-2        0.274
 7 2              1 Zfp2-1        0.186
 8 2              1 Zfp2-2        0.203
 9 2              2 Zfp2-1        0.163
10 2              2 Zfp2-2        0.134
# ℹ 20 more rows

Adding rows: bind_rows()

wt_data <- read_tsv("data/WT.tsv")

wt_data
# A tibble: 15 × 4
   Genotype Biorep Replicate Proportion
   <chr>     <dbl>     <dbl>      <dbl>
 1 Col-0         1         1       0.17
 2 Col-0         1         2       0.22
 3 Col-0         1         3       0.16
 4 Col-0         2         1       0.13
 5 Col-0         2         2       0.13
 6 Col-0         2         3       0.21
 7 Col-0         3         1       0.3 
 8 Col-0         3         2       0.27
 9 Col-0         3         3       0.27
10 Col-0         4         1       0.12
11 Col-0         4         2       0.18
12 Col-0         4         3       0.15
13 Col-0         5         1       0.26
14 Col-0         5         2       0.26
15 Col-0         5         3       0.21
data_sep
# A tibble: 30 × 4
   Genotype Biorep Replicate Proportion
   <chr>    <chr>      <dbl>      <dbl>
 1 Zfp2-1   1              1     0.132 
 2 Zfp2-1   1              2     0.202 
 3 Zfp2-1   1              3     0.150 
 4 Zfp2-1   2              1     0.186 
 5 Zfp2-1   2              2     0.163 
 6 Zfp2-1   2              3     0.214 
 7 Zfp2-1   3              1     0.107 
 8 Zfp2-1   3              2     0.198 
 9 Zfp2-1   3              3     0.186 
10 Zfp2-1   4              1     0.0729
# ℹ 20 more rows

Adding rows: bind_rows()

try(
  data_bound <- data_sep |>
    bind_rows(wt_data)
)
Error in bind_rows(data_sep, wt_data) : 
  Can't combine `..1$Biorep` <character> and `..2$Biorep` <double>.



To bind or join data frames, columns need to have compatible types

wt_data <- read_tsv(
  "data/WT.tsv",
  col_types = "ccnn"
)

data_bound <- data_sep |>
  bind_rows(wt_data)

slice_sample(data_bound, n = 10)
# A tibble: 10 × 4
   Genotype Biorep Replicate Proportion
   <chr>    <chr>      <dbl>      <dbl>
 1 Zfp2-2   4              1      0.164
 2 Col-0    2              2      0.13 
 3 Col-0    4              2      0.18 
 4 Zfp2-2   2              2      0.134
 5 Zfp2-2   1              2      0.266
 6 Zfp2-2   3              2      0.219
 7 Col-0    1              2      0.22 
 8 Zfp2-2   1              3      0.274
 9 Zfp2-2   5              3      0.212
10 Col-0    3              1      0.3  

Adding columns: *_join()

genetic_data <- read_tsv("data/genetics.tsv")

genetic_data
# A tibble: 3 × 2
  Genotype Transgenic
  <chr>    <chr>     
1 Col-0    No        
2 Zfp2-1   Yes       
3 Zfp2-2   Yes       
data_joined <- data_bound |>
  left_join(genetic_data)

data_joined
# A tibble: 45 × 5
   Genotype Biorep Replicate Proportion Transgenic
   <chr>    <chr>      <dbl>      <dbl> <chr>     
 1 Zfp2-1   1              1     0.132  Yes       
 2 Zfp2-1   1              2     0.202  Yes       
 3 Zfp2-1   1              3     0.150  Yes       
 4 Zfp2-1   2              1     0.186  Yes       
 5 Zfp2-1   2              2     0.163  Yes       
 6 Zfp2-1   2              3     0.214  Yes       
 7 Zfp2-1   3              1     0.107  Yes       
 8 Zfp2-1   3              2     0.198  Yes       
 9 Zfp2-1   3              3     0.186  Yes       
10 Zfp2-1   4              1     0.0729 Yes       
# ℹ 35 more rows

Exercise 2

https://github.com/leonardblaschek/data_wrangling

Data analysis

data_full <- data_filtered |>
  select(
    Sample,
    Replicate,
    "Amount" = `Amount (based on standard curve)`,
    "Weight" = Weighed,
    "Proportion" = `Weight%`
  ) |>
  separate_wider_delim(
    Sample,
    names = c("Genotype", "Biorep"),
    delim = " "
  )

One row in, on row out: mutate()

data_mutated <- data_full |>
  mutate("Percent" = (Amount / Weight) * 100)

data_mutated
# A tibble: 30 × 7
   Genotype Biorep Replicate Amount Weight Proportion Percent
   <chr>    <chr>      <dbl>  <dbl>  <dbl>      <dbl>   <dbl>
 1 Zfp2-1   1              1  1.06       8     0.132    13.2 
 2 Zfp2-1   1              2  1.01       5     0.202    20.2 
 3 Zfp2-1   1              3  0.751      5     0.150    15.0 
 4 Zfp2-1   2              1  1.12       6     0.186    18.6 
 5 Zfp2-1   2              2  0.816      5     0.163    16.3 
 6 Zfp2-1   2              3  1.28       6     0.214    21.4 
 7 Zfp2-1   3              1  0.644      6     0.107    10.7 
 8 Zfp2-1   3              2  1.19       6     0.198    19.8 
 9 Zfp2-1   3              3  1.12       6     0.186    18.6 
10 Zfp2-1   4              1  0.364      5     0.0729    7.29
# ℹ 20 more rows

One row in, on row out: mutate()

data_mutated <- data_full |>
  mutate("Percent" = (Amount / Weight) * 100)

ggplot(
  data_mutated,
  aes(x = Proportion, y = Percent)
) +
  geom_smooth(
    method = "lm",
    formula = "y ~x"
  ) +
  geom_point() +
  theme_classic()

Conditional assignment: case_when()

data_mutated <- data_full |>
  mutate("Transgenic" = case_when(
    Genotype %in% c("Zfp2-1", "Zfp2-2") ~ "Yes",
    .default = "No"
  ))

data_mutated
# A tibble: 30 × 7
   Genotype Biorep Replicate Amount Weight Proportion Transgenic
   <chr>    <chr>      <dbl>  <dbl>  <dbl>      <dbl> <chr>     
 1 Zfp2-1   1              1  1.06       8     0.132  Yes       
 2 Zfp2-1   1              2  1.01       5     0.202  Yes       
 3 Zfp2-1   1              3  0.751      5     0.150  Yes       
 4 Zfp2-1   2              1  1.12       6     0.186  Yes       
 5 Zfp2-1   2              2  0.816      5     0.163  Yes       
 6 Zfp2-1   2              3  1.28       6     0.214  Yes       
 7 Zfp2-1   3              1  0.644      6     0.107  Yes       
 8 Zfp2-1   3              2  1.19       6     0.198  Yes       
 9 Zfp2-1   3              3  1.12       6     0.186  Yes       
10 Zfp2-1   4              1  0.364      5     0.0729 Yes       
# ℹ 20 more rows

Calculations within groups: group_by()

data_grouped <- data_full |>
  group_by(Genotype, Biorep)

data_grouped
# A tibble: 30 × 6
# Groups:   Genotype, Biorep [10]
   Genotype Biorep Replicate Amount Weight Proportion
   <chr>    <chr>      <dbl>  <dbl>  <dbl>      <dbl>
 1 Zfp2-1   1              1  1.06       8     0.132 
 2 Zfp2-1   1              2  1.01       5     0.202 
 3 Zfp2-1   1              3  0.751      5     0.150 
 4 Zfp2-1   2              1  1.12       6     0.186 
 5 Zfp2-1   2              2  0.816      5     0.163 
 6 Zfp2-1   2              3  1.28       6     0.214 
 7 Zfp2-1   3              1  0.644      6     0.107 
 8 Zfp2-1   3              2  1.19       6     0.198 
 9 Zfp2-1   3              3  1.12       6     0.186 
10 Zfp2-1   4              1  0.364      5     0.0729
# ℹ 20 more rows

Many rows in, one row out: summarise()

data_summarised <- data_full |>
  group_by(Genotype, Biorep) |>
  summarise(
    Replicate_mean = mean(Proportion)
  )

data_summarised
# A tibble: 10 × 3
# Groups:   Genotype [2]
   Genotype Biorep Replicate_mean
   <chr>    <chr>           <dbl>
 1 Zfp2-1   1               0.162
 2 Zfp2-1   2               0.188
 3 Zfp2-1   3               0.164
 4 Zfp2-1   4               0.133
 5 Zfp2-1   5               0.148
 6 Zfp2-2   1               0.242
 7 Zfp2-2   2               0.175
 8 Zfp2-2   3               0.210
 9 Zfp2-2   4               0.187
10 Zfp2-2   5               0.167

Many rows in, one row out: summarise()

data_summarised <- data_full |>
  group_by(Genotype, Biorep) |>
  summarise(
    Replicate_mean = mean(Proportion)
  ) |>
  group_by(Genotype) |>
  summarise(
    Genotype_mean = mean(Replicate_mean)
  )

data_summarised
# A tibble: 2 × 2
  Genotype Genotype_mean
  <chr>            <dbl>
1 Zfp2-1           0.159
2 Zfp2-2           0.196

Like a loop, but better: map()

lac_data <- list.files(
  path = "data/lac_data/",
  pattern = ".csv",
  full.names = TRUE
) |>
  read_csv(id = "file")

lac_data
# A tibble: 1,800 × 7
   file  stained_absorbance unstained_absorbance stained_hue unstained_hue image
   <chr>              <dbl>                <dbl>       <dbl>         <dbl> <chr>
 1 data…              0.214                0.196         115           119 021-…
 2 data…              0.211                0.196         117           182 021-…
 3 data…              0.182                0.173         124           147 021-…
 4 data…              0.271                0.277         122           200 021-…
 5 data…              0.211                0.225         119           170 021-…
 6 data…              0.316                0.316         121           181 021-…
 7 data…              0.308                0.259         112           170 021-…
 8 data…              0.329                0.26          118           227 021-…
 9 data…              0.353                0.314         117           143 021-…
10 data…              0.265                0.253         119           149 021-…
# ℹ 1,790 more rows
# ℹ 1 more variable: cell_type <chr>

Like a loop, but better: map()

lac_summarised <- lac_data |>
  mutate(
    image = basename(file)
  ) |>
  separate_wider_delim(
    image,
    delim = "_",
    names = c("date", "genotype", "replicate"),
    too_many = "drop"
  ) |>
  mutate(intensity = stained_absorbance - unstained_absorbance) |>
  group_by(genotype, cell_type, replicate) |>
  summarise("mean_intensity" = mean(intensity)) |>
  select(genotype, cell_type, mean_intensity)


lac_summarised
# A tibble: 90 × 3
# Groups:   genotype, cell_type [18]
   genotype cell_type mean_intensity
   <chr>    <chr>              <dbl>
 1 Q        BG              -0.00135
 2 Q        BG              -0.00155
 3 Q        BG              -0.00125
 4 Q        BG              -0.0032 
 5 Q        BG              -0.00285
 6 Q        IF               0.0042 
 7 Q        IF               0.0188 
 8 Q        IF               0.0142 
 9 Q        IF               0.0224 
10 Q        IF               0.0511 
# ℹ 80 more rows

Like a loop, but better: map()

lac_nested <- lac_summarised |>
  nest("data" = c(genotype, mean_intensity))

lac_nested
# A tibble: 9 × 2
# Groups:   cell_type [9]
  cell_type data               
  <chr>     <list>             
1 BG        <gropd_df [10 × 2]>
2 IF        <gropd_df [10 × 2]>
3 IF-CML    <gropd_df [10 × 2]>
4 LP        <gropd_df [10 × 2]>
5 MX        <gropd_df [10 × 2]>
6 PH        <gropd_df [10 × 2]>
7 PX        <gropd_df [10 × 2]>
8 XF        <gropd_df [10 × 2]>
9 XF-CML    <gropd_df [10 × 2]>
library(broom)

lac_p <- lac_nested |>
  mutate(
    "t_test" = map(data, \(x) t.test(mean_intensity ~ genotype, data = x)),
    "t_test_tidy" = map(t_test, tidy),
    "p_value" = map_dbl(t_test_tidy, \(x) pull(x, "p.value"))
  )

lac_p
# A tibble: 9 × 5
# Groups:   cell_type [9]
  cell_type data                t_test  t_test_tidy          p_value
  <chr>     <list>              <list>  <list>                 <dbl>
1 BG        <gropd_df [10 × 2]> <htest> <tibble [1 × 10]> 0.344     
2 IF        <gropd_df [10 × 2]> <htest> <tibble [1 × 10]> 0.00000231
3 IF-CML    <gropd_df [10 × 2]> <htest> <tibble [1 × 10]> 0.00000343
4 LP        <gropd_df [10 × 2]> <htest> <tibble [1 × 10]> 0.00209   
5 MX        <gropd_df [10 × 2]> <htest> <tibble [1 × 10]> 0.0507    
6 PH        <gropd_df [10 × 2]> <htest> <tibble [1 × 10]> 0.0205    
7 PX        <gropd_df [10 × 2]> <htest> <tibble [1 × 10]> 0.106     
8 XF        <gropd_df [10 × 2]> <htest> <tibble [1 × 10]> 0.000294  
9 XF-CML    <gropd_df [10 × 2]> <htest> <tibble [1 × 10]> 0.00106   

Exercise 3

https://github.com/leonardblaschek/data_wrangling

When you’re stuck

  1. Know which package/function you need? — Help pages and vignettes!
  2. Know what you want to do but not where to start? — Try an LLM, e.g. perplexity.ai
  3. I feel like I’ve done this before … — Keep your old scripts organised and annotated

Resources to go further